import sys
sys.path.append("used_functions/")
import custom_functions as custfun
import importlib
importlib.reload(custfun)
import pandas as pd
from pandas.plotting import scatter_matrix
import numpy as np
from matplotlib import pyplot as plt
import matplotlib as mpl
import matplotlib.dates as mdates
from datetime import timedelta
import re
import warnings
%matplotlib inline
plt.style.use("ggplot")
The first dataset is an export of my ride data from Strava, an online social
network site for cycling and other sports. This data is a log of every ride since the start of 2018
and contains summary data like the distance and average speed. It was exported using
the script stravaget.py which uses the stravalib module to read data. Some details of
the fields exported by that script can be seen in the documentation for stravalib.
The exported data is a CSV file so that's easy to read, however the date information in the file is recorded in a different timezone (UTC) so we need to do a bit of conversion. In reading the data I'm setting the index of the data frame to be the datetime of the ride.
# Loading strava_exoport.csv
strava = pd.read_csv('Portfolio1/data/strava_export.csv', index_col = 'date', parse_dates = True)
# Converting the timezone to Aus/Syd
strava.index = strava.index.tz_localize('Australia/Sydney')
print("Shape of the dataframe(row, column) : ", strava.shape)
strava.head()
The second dataset comes from an application called GoldenCheetah which provides some analytics services over ride data. This has some of the same fields but adds a lot of analysis of the power, speed and heart rate data in each ride. This data overlaps with the Strava data but doesn't include all of the same rides.
Again we create an index using the datetime for each ride, this time combining two columns in the data (date and time) and localising to Sydney so that the times match those for the Strava data.
# Loading the data from cheetah.csv
cheetah = pd.read_csv('Portfolio1/data/cheetah.csv', skipinitialspace = True)
# Setting combination of time and date as index
cheetah.index = pd.to_datetime(cheetah['date'] + ' ' + cheetah['time'])
# Converting the timezone to Aus/Syd
cheetah.index = cheetah.index.tz_localize('Australia/Sydney')
print("Shape of the dataframe(row, column) : ", cheetah.shape)
cheetah.head()
The GoldenCheetah data contains many many variables (columns) and I won't go into all of them here. Some that are of particular interest for the analysis below are:
Here are definitions of some of the more important fields in the data. Capitalised fields come from the GoldenCheetah data while lowercase_fields come from Strava. There are many cases where fields are duplicated and in this case the values should be the same, although there is room for variation as the algorithm used to calculate them could be different in each case.
Some of the GoldenCheetah parameters are defined in their documentation.
Your first task is to combine these two data frames using the join method of Pandas. The goal is to keep only those rows of data
that appear in both data frames so that we have complete data for every row.
# Carrying out inner join on both dataframe
## strava_plus_cheetah = strava.join(cheetah, how = "inner")
## Join is not working due to some problems so reading csv file
strava_plus_cheetah = pd.read_csv("Portfolio1/data/strava_plus_cheetah.csv", parse_dates = True)
# Setting an index
strava_plus_cheetah.index = strava_plus_cheetah["index"]
# Deleting index column and name of index of df
del strava_plus_cheetah["index"]
del strava_plus_cheetah.index.name
print("Shape of the dataframe(row, column) :", strava_plus_cheetah.shape)
strava_plus_cheetah.head()
# Creating a boolean flag based on condition devise_watts is false
flag = strava_plus_cheetah["device_watts"] != False
# Filtering data based on boolean flag
strava_plus_cheetah = strava_plus_cheetah[flag]
print("Shape of the dataframe(row, column) :", strava_plus_cheetah.shape)
## strava_plus_cheetah = strava_plus_cheetah.drop((strava_plus_cheetah[not flag]).index) # Another way to achieve above task
mpl.rcParams['axes.titlesize'] = 22 # -Change title size of a plot
mpl.rcParams['axes.labelsize'] = 16 # -Change label size(x and y) of a plot
mpl.rcParams['xtick.labelsize'] = 15 # -Change xticks size of a plot
mpl.rcParams['ytick.labelsize'] = 15 # -Change yticks size of a plot
# For more detials about this function check used_function/custom_functions.py
custfun.create_histogram_plus_boxplot(strava_plus_cheetah["moving_time"], "Moving time", "blue",
"Rides (count)", "Time (minutes)", (10, 10))
By looking at the histogram and boxplot, One can say that the distribution is not normal. In fact it is right skewed(Right box is longer than left box in boxplot). Generally such type of distribution is called bimodal(check density plot below) as you can see two peaks in histogram(one at median value approx, 65 minutes and another peak at approx. 135 minutes). We can also figure out following from the boxplot. Boxplots are also called 5 number summary.
No outliers are detected
# For more detials about this function check used_function/custom_functions.py
custfun.create_histogram_plus_boxplot(strava_plus_cheetah["elapsed_time"], "Elapsed time", "salmon",
"Rides (count)", "Time (minutes)", (10, 10))
Distribution is not normal. In fact it is right skewed(Right box is longer than left box in boxplot). Generally such type of distribution is called bimodal(check density plot below) as you can see two peaks in histogram(one at median value approx, 70 minutes and another peak at approx. 200 minutes). We can also figure out following from the boxplot. Boxplots are also called 5 number summary.
No outliers are detected
# Removing the rows where distance = 0 km
strava_plus_cheetah = strava_plus_cheetah[strava_plus_cheetah["distance"] != 0]
# For more detials about this function check used_function/custom_functions.py
custfun.create_histogram_plus_boxplot(strava_plus_cheetah["distance"], "Distance", "mediumseagreen",
"Rides (count)", "Distance (km)", (10, 10))
Distribution is right skewed(Right box is longer than left box in boxplot). Generally such type of distribution is called bimodal(check density plot below) as you can see two peaks in histogram(one at median value approx, 30 km and another peak at approx. 60 km). We can also figure out following from the boxplot. Boxplots are also called 5 number summary.
No outliers are detected
# For more detials about this function check used_function/custom_functions.py
custfun.create_histogram_plus_boxplot(strava_plus_cheetah["Average Speed"], "Average Speed", "deepskyblue",
"Rides (count)", "Speed (km/h)", (10, 10))
Distribution is nearly normal(Right box has almost equal length as left box in boxplot)(check density plot below). We can also figure out following from the boxplot. Boxplots are also called 5 number summary.
Outliers detected
# Removing rows where Average power = 0
strava_plus_cheetah = strava_plus_cheetah[strava_plus_cheetah["Average Power"] != 0]
# For more detials about this function check used_function/custom_functions.py
custfun.create_histogram_plus_boxplot(strava_plus_cheetah["Average Power"], "Average Power", "slateblue",
"Rides (count)", "Power (watts)", (10, 10))
Distribution is normal(check density plot below). We can also figure out following from the boxplot. Boxplots are also called 5 number summary.
Outliers detected
# For more detials about this function check used_function/custom_functions.py
custfun.create_histogram_plus_boxplot(strava_plus_cheetah["TSS"], "Training stress score", "mediumorchid",
"Rides (count)", "Training Stress Score", (10, 10))
Distribution is Right skewed(Right box has more length than left box in boxplot)(check density plot below). We can also figure out following from the boxplot. Boxplots are also called 5 number summary.
Outliers detected
Plotting them together helps us to compare the characterstics of variables with other variables
## Histograms
# Selecting requied columns
new_dataframe = strava_plus_cheetah[["distance", "moving_time", "elapsed_time" , "Average Speed",
"Average Power", "TSS"]]
# Plotting a histogram
new_dataframe.hist(figsize = (10, 10), edgecolor = "black", layout = (3, 3))
plt.show()
## We are only interesed in distributions of the variables that's why labels and units are not displayed on the x-axis.
## Density plot shows density curves. It can be useful to easily identify the distribution of the variables.
# Plotting density curves
new_dataframe.plot(kind='density', subplots = True, layout = (3, 3),
figsize = (15, 10), sharex = False, sharey = False)
# Spaces between subplots
plt.tight_layout()
plt.show()
## We are only interesed in distributions of the variables that's why labels and units are not displayed on the x-axis.
## Boxplots are also useful to identify how much variation a variable has.
# Plotting the boxplots
new_dataframe.plot(kind='box', subplots = True, layout = (3, 3),
figsize = (10,10), sharex = False, sharey = False)
# Spaces between subplots
plt.tight_layout()
plt.show()
## We are only interesed in variation of the variables that's why labels and units are not displayed on the y-axis.
# Ignoring the simple warning for this cell
warnings.simplefilter(action = 'ignore')
# Removing rows where Avg heart rate = 0
strava_plus_cheetah = strava_plus_cheetah[strava_plus_cheetah["Average Heart Rate"] != 0.0]
# Getting required variables
variables_in_que3 = strava_plus_cheetah[["distance", "moving_time", "Average Speed", "Average Heart Rate",
"Average Power","NP", "TSS", "elevation_gain"]]
# Cleaning elevation_gain
# For more detials about this function check used_function/custom_functions.py
variables_in_que3["elevation_gain"] = custfun.clean_elevation_gain(variables_in_que3["elevation_gain"])
# Creating correlation matrix
# For more detials about this function check used_function/custom_functions.py
custfun.crete_correlogram(df = variables_in_que3, size = (10, 10), xlab_rotation = 45)
# For more detials about this function check used_function/custom_functions.py
custfun.create_scatter_matrix(variables_in_que3, xlab_rotation = 45, ylab_rotation = 60,
xticklab_fontsize = 12, yticklab_fontsize = 12)
# Selecting important variables
important_dataframe = strava_plus_cheetah[["distance", "moving_time", "elapsed_time",
"elevation_gain", "Gradient", "Average Speed", "Average Power","Average Heart Rate",
"Average Cadence", "Calories (HR)", "TSS", "kudos", "workout_type"]]
# Removing rows where heart rate is 0
important_dataframe = important_dataframe[important_dataframe["Average Heart Rate"] != 0]
# Cleaning elevation_gain
# For more detials about this function check used_function/custom_functions.py
important_dataframe["elevation_gain"] = custfun.clean_elevation_gain(important_dataframe["elevation_gain"]) # -See documentation above
def process_workout_type():
"""
Function that will create new list based on workout_type.
New list will be containing integers 1, 2 and 3.
where,
1 : Ride
2 : Workout
3 : Race
This is done to help plot graphs based on category.
"""
temp_list = []
for i in important_dataframe["workout_type"]:
if i == "Ride":
temp_list.append(1)
elif i == "Workout":
temp_list.append(2)
else:
temp_list.append(3)
return temp_list
important_dataframe["workout_type_int"] = process_workout_type()
# For more detials about this function check used_function/custom_functions.py
custfun.plots_with_workout(df = important_dataframe, kind = "scatter")
# For more detials about this function check used_function/custom_functions.py
custfun.plots_with_workout(df = important_dataframe, kind = "box")
Kudos - More kudos to Race and Ride compare to Workout. But more variation in Ride compare to other two categories.
*We have seen all the variables, We noticed that Ride has more variation to it. Maybe it is because people are generally attending more Ride compare to other categories(Count of workout type Ride is higher compare to other categories) and also people are not paying more attention to things when they are just casually riding the bike. That explains more variations.
# Deleting the workout_type column
del important_dataframe["workout_type"]
# For more detials about this function check used_function/custom_functions.py
custfun.crete_correlogram(df = important_dataframe, size = (15, 15), corr_font_size = 16, xlab_rotation = 45)
# Creating a copy for a dataframe
temp = strava_plus_cheetah.copy()
# Creating a new column called date_new using index column of strava_plus_cheetah
temp["date_new"] = pd.to_datetime(strava_plus_cheetah.index)
# Grouping distance, TSS and Average speed by month
per_month_distance = temp.set_index('date_new').groupby(pd.Grouper(freq = 'M'))['distance'].sum()
per_month_tss = temp.set_index('date_new').groupby(pd.Grouper(freq = 'M'))['TSS'].sum()
per_month_avg_avg_speed = temp.set_index('date_new').groupby(pd.Grouper(freq = 'M'))['Average Speed'].mean()
# Creating a new DataFrame
per_month = pd.DataFrame()
per_month["Date"] = per_month_distance.index
per_month["Distance"] = per_month_distance.values
per_month["TSS"] = per_month_tss.values
per_month["Avg Avg Speed"] = per_month_avg_avg_speed.values
# Rounding up whole dataframe to 2 digits
per_month = per_month.round(2)
# Removing the timezone detail
per_month["Date"] = per_month['Date'].dt.tz_localize(None)
per_month.iloc[:, :]
# Plotting an empty figure
plt.figure(figsize = (18, 18))
# Plotting three lines
p1, = plt.plot(per_month["Date"], per_month["Distance"], marker = 'o', label = "Distance" )
p2, = plt.plot(per_month["Date"], per_month["TSS"], marker = 's', label = "TSS")
p3, = plt.plot(per_month["Date"], per_month["Avg Avg Speed"], marker = '^', label = "Avg Avg Speed")
# Setting xlabel and title of the plot
plt.xlabel("Year - Month", fontsize = 16, fontweight = "bold")
plt.title("Distance,TSS and Average Speed grouped by Months", fontsize = 18)
# Setting x-axis locator, formatter and xticklabels
plt.gca().xaxis.set_major_locator(mdates.MonthLocator())
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m'))
plt.setp(plt.gca().get_xticklabels(),rotation = 90)
# Plotting legends for lines
lines = [p1, p2, p3]
plt.legend(lines, [l.get_label() for l in lines], fontsize = 12)
# Plotting values for every data point
for i in range(per_month.shape[0]):
plt.text(per_month["Date"][i], per_month["Distance"][i], str(per_month["Distance"][i]),
fontsize = 14, bbox = dict(facecolor = 'salmon', alpha = 0.6))
plt.text(per_month["Date"][i], per_month["TSS"][i], str(per_month["TSS"][i]),
fontsize = 14, bbox = dict(facecolor = 'skyblue', alpha = 0.6))
# -55 -> just so that value label won't overlap with the other elements
plt.text(per_month["Date"][i], per_month["Avg Avg Speed"][i] - 55, str(per_month["Avg Avg Speed"][i]),
fontsize = 14, bbox = dict(facecolor = 'mediumpurple', alpha = 0.3), rotation = 335)
plt.show()
# For more detials about this function check used_function/custom_functions.py
custfun.create_lineplot_multiple_YAXIS(per_month, "Date", "Distance", "TSS", "Avg Avg Speed",
"Year-Month", "Km", "score", "Km/h", c1 = "red", c2 = "royalblue", c3 = "green")
We can observe following patterns here:-
# Creating a copy for a dataframe
temp = strava_plus_cheetah.copy()
# Creating a new column called date_new using index column of strava_plus_cheetah
temp["date_new"] = pd.to_datetime(strava_plus_cheetah.index)
# Grouping the distance by day
per_day_distance = temp.set_index('date_new').groupby(pd.Grouper(freq = 'D'))['distance'].sum()
# Creating a new dataframe
per_day = pd.DataFrame()
# Creating columns in new dataframe
per_day["date"] = per_day_distance.index
per_day["date"] = pd.to_datetime(per_day["date"])
per_day["distance"] = per_day_distance.values
# Rounding up whole dataframe to 2 digits
per_day = per_day.round(2)
# Distance travelled over given month
## Try it yourself
month = 3
## For more detials about this function check used_function/custom_functions.py
custfun.activityOverGivenMonth(per_day, 2018, month)
# Loading the libraries
import sys
sys.path.append("used_functions/")
import custom_functions as custfun
import importlib
importlib.reload(custfun)
import pandas as pd
from pandas.plotting import scatter_matrix
import numpy as np
from matplotlib import pyplot as plt
plt.style.use("ggplot")
import matplotlib.dates as mdates
import seaborn as sns
import matplotlib as mpl
from sklearn import linear_model
from sklearn.model_selection import KFold, RepeatedKFold, cross_val_score
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.feature_selection import RFE
# Reading the data
training = pd.read_csv("Portfolio2/training.csv")
testing = pd.read_csv("Portfolio2/testing.csv")
training["date"] = pd.to_datetime(training["date"])
testing["date"] = pd.to_datetime(testing["date"])
print(training.shape)
print(testing.shape)
# % data in training
print("% of data in training dataset : ", str(round(training.shape[0]/(training.shape[0] + testing.shape[0]) * 100))+"%")
# % data in testing
print("% of data in testing dataset : ", str(round(testing.shape[0]/(training.shape[0] + testing.shape[0]) * 100))+"%")
Data is partitioned in two parts [training->75\% and testing 25\%] as mentioned in the paper.
mpl.rcParams['axes.titlesize'] = 22 # -Change title size of a plot
mpl.rcParams['axes.labelsize'] = 16 # -Change label size(x and y) of a plot
mpl.rcParams['xtick.labelsize'] = 15 # -Change xticks size of a plot
mpl.rcParams['ytick.labelsize'] = 15 # -Change yticks size of a plot
# Plotting the figure
plt.figure(figsize=(18,8))
# Adding the line plot to figure
plt.plot(training["date"], training["Appliances"])
# Setting plot parameters like xtick locators and labels
plt.gca().xaxis.set_major_locator(mdates.MonthLocator())
plt.gca().set_xticklabels(["Feb","March", "April", "May", "June"])
plt.setp(plt.gca().get_xticklabels(),rotation = 0)
# Setting x-axis, y-axis labels and title of the plot
plt.xlabel("Months")
plt.ylabel("Appliances Wh")
plt.title("Appliances energy consumption measurement for the whole period")
plt.show()
We can observe the Appliances' consumption is following some kind pattern. Like before the begining of February it show low consuption then variable consuption till April, after April it again shows low consumption and then pattern follows.
# Mapping string_to_int function to get weeks
data_weeks = pd.Series(map(custfun.string_to_int, training["date"].dt.strftime('%W')))
# Finding from which week data is starting
min_week = min(data_weeks)
# Filtering data of where week = min_week
flag_week1 = (data_weeks == min_week)
first_week = training.loc[flag_week1]
# Plotting the figure
plt.figure(figsize=(15,10))
# Adding the line plot to figure
plt.plot(first_week["date"], first_week["Appliances"])
# Setting plot parameters like xtick locators and labels
plt.gca().xaxis.set_major_locator(mdates.DayLocator())
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('Jan %d'))
plt.setp(plt.gca().get_xticklabels(),rotation = 0)
# Setting x-axis, y-axis labels and title of the plot
plt.xlabel("Days")
plt.ylabel("Appliances Wh")
plt.title("A closer look at the first week of data")
plt.show()
The appliances’ consumption profile is highly variable as seen in above plot, with periods of almost constant demand followed by high spikes.
# For more detials about this function check used_function/custom_functions.py
custfun.create_histogram_plus_boxplot(training["Appliances"], "Appliances", "slateblue", "Counts", "Appliances(Wh)", (15,15))
Above plot shows histogram and boxplot of the data. As can be seen the data distribution has a long tail. In the boxplot, the median is represented with a thick black line inside the blue rectangle, and has a value of 60Wh. The lower whisker has a value of 10Wh and the upper whisker has a value of 170Wh. It also shows that the data above the median is more dispersed and that there are several outliers (marked withthe round circles above the upper whisker)
# For more detials about this function check used_function/custom_functions.py
custfun.create_scatter_matrix(training[["Appliances", "lights", "T1", "RH_1", "T2", "RH_2", "T3", "RH_3"]])
Plot shows that there is a positive correlation between the energy consumption of appliances and lights (0.19). The second largest correlation is between appliances and T2. For the indoor temperatures, the correlations are high as expected. For example, a positive correlation is found with T1 and T3.
# For more detials about this function check used_function/custom_functions.py
custfun.create_scatter_matrix(training[["Appliances", "T4", "RH_4", "T5", "RH_5", "T6", "RH_6"]])
Plot shows that the highest correlation with the appliances is between the outdoor temperature (0.12). There is also a negative correlation between the appliances and outdoor humidity/RH6 (−0.09)
# For more detials about this function check used_function/custom_functions.py
custfun.create_scatter_matrix(training[["Appliances", "T7", "RH_7", "T8", "RH_8", "T9", "RH_9"]])
Plot shows positive correlations between the consumption of appliances and T7, T8 and T9 being 0.03, 0.05 and 0.02 respectively and negative correlation with RH_7(-0.06), RH_8(-0.09), RH_9(-0.05). The consumption of appliances have postive correlations with temperatures and negative correlation with Humidities.
# For more detials about this function check used_function/custom_functions.py
custfun.create_scatter_matrix(training[["Appliances", "T_out", "Press_mm_hg", "RH_out", "Windspeed",
"Visibility", "Tdewpoint", "NSM", "T6"]], ylab_rotation = 45)
Plot shows the highest correlation between the energy consumption of appliances and NSM with a value of 0.22. A positive correlation of 0.10 is seen between appliances’ consumption and outdoor temperature (Tout) that is, the higher temperatures, the higher the energy use by the appliances. Also there is a positive correlation with appliances’ consumption and wind speed (0.09), higher wind speeds correlate with higher energy consumption by the appliances. A negative correlation of −0.15 was found with the RHout, and of −0.03 with pressure. Another important and interesting correlation is between the pressure and the wind speed. This relationship is negative (−0.23). The linear trend is with lower pressure the wind speed will be higher.
# Getting different dataframes for first four weeks from the data
## For more detials about this function check used_function/custom_functions.py
first_week_appliances_df = custfun.create_weekdata(training, 1)
second_week_appliances_df = custfun.create_weekdata(training, 2)
third_week_appliances_df = custfun.create_weekdata(training, 3)
forth_week_appliances_df = custfun.create_weekdata(training, 4)
# We don't have data for first week in January. Data starts from Monday from 17th hour.
first_week_appliances_df
I have printed above dataframe just to show that there is no data before hour 17 on first day of the data
# Plotting the figure
fig = plt.figure(figsize = (20,20))
# Adding four subplots - each for each week
ax1 = fig.add_subplot(141)
ax2 = fig.add_subplot(142)
ax3 = fig.add_subplot(143)
ax4 = fig.add_subplot(144)
# Plotting heatmap for each week
sns.heatmap(first_week_appliances_df.sort_index(ascending = False), cmap = "YlOrRd", ax = ax1,
square = True, cbar_kws = {"shrink" : 0.45}, linewidths = 0.5)
sns.heatmap(second_week_appliances_df.sort_index(ascending = False), cmap = "YlOrRd", ax = ax2,
square = True, cbar_kws = {"shrink" : 0.45}, linewidths = 0.5)
sns.heatmap(third_week_appliances_df.sort_index(ascending = False), cmap = "YlOrRd", ax = ax3,
square = True, cbar_kws = {"shrink" : 0.45}, linewidths = 0.5)
sns.heatmap(forth_week_appliances_df.sort_index(ascending = False), cmap = "YlOrRd", ax = ax4,
square = True, cbar_kws = {"shrink" : 0.45}, linewidths = 0.5)
# Removing labels and ticks from the axes
ax1.set_ylabel('');ax1.set_xlabel('')
ax2.set_ylabel('');ax2.set_xlabel('');ax2.set_yticks([]);ax2.set_yticklabels([])
ax3.set_ylabel('');ax3.set_xlabel('');ax3.set_yticks([]);ax3.set_yticklabels([])
ax4.set_ylabel('');ax4.set_xlabel('');ax4.set_yticks([]);ax4.set_yticklabels([])
# Setting x-axis and y-axis label for plot
fig.text(0.5, 0.20, "Days in a Week", fontsize = 18, ha = 'center')
fig.text(0.1, 0.5, "Hours in a day", fontsize = 18, va = 'center', rotation = 'vertical')
plt.show()
The energy consumption starts to rise around 6 in the morning. Then around noon, there are energy load surges. The energy demand also increases around 6 pm. There is no clear pattern regarding the day of the week
# Taking all the columns as mentioned in the paper
training_modified_1 = training.copy()
training_modified_1 = training_modified_1.iloc[:, 1:]
training_modified_1["WeekStatus"] = training_modified_1["WeekStatus"].astype("category")
training_modified_1["Day_of_week"] = training_modified_1["Day_of_week"].astype("category")
# Getting categorical columns and convert them to codes
categorical_cols = training_modified_1.select_dtypes(['category']).columns
training_modified_1[categorical_cols] = training_modified_1[categorical_cols].apply(lambda x: x.cat.codes)
# Taking all the columns as mentioned in the paper
testing_modified_1 = testing.copy()
testing_modified_1 = testing_modified_1.iloc[:, 1:]
testing_modified_1["WeekStatus"] = testing_modified_1["WeekStatus"].astype("category")
testing_modified_1["Day_of_week"] = testing_modified_1["Day_of_week"].astype("category")
# Getting categorical columns and convert them to codes
categorical_cols = testing_modified_1.select_dtypes(['category']).columns
testing_modified_1[categorical_cols] = testing_modified_1[categorical_cols].apply(lambda x: x.cat.codes)
# Taking x(independent) and y(dependent) variables for models, train for training and test for testing
x_train = training_modified_1.iloc[:, 1:]
y_train = training_modified_1.iloc[:, 0]
x_test = testing_modified_1.iloc[:, 1:]
y_test = testing_modified_1.iloc[:, 0]
# Creating the linear model
paper_model = linear_model.LinearRegression()
# Fitting the training data
paper_model.fit(x_train, y_train)
# Predicting the values for training data
predicted_train = paper_model.predict(x_train)
# Calculating various metrics for evaluation purposes
rmse_train = mean_squared_error(y_train, predicted_train) ** 0.5
mae_train = mean_absolute_error(y_train, predicted_train)
## For more detials about this function check used_function/custom_functions.py
mape_train = custfun.mean_absolute_percentage_error(y_train, predicted_train)
r2_train = r2_score(y_train, predicted_train)
# Printing metrics given in paper
print("RMSE(Root Mean Squared Error):", round(rmse_train, 2))
print("R Squared:", round(r2_train, 2))
print("MAE(Mean Absolute Error):", round(mae_train, 2))
print("MAPE(Mean Absolute Percentage Error):", round(mape_train, 2))
# Predicting the values for testing data
predicted_test = paper_model.predict(x_test)
# Calculating various metrix for evaluation purposes
rmse_test = mean_squared_error(y_test, predicted_test) ** 0.5
mae_test = mean_absolute_error(y_test, predicted_test)
## For more detials about this function check used_function/custom_functions.py
mape_test = custfun.mean_absolute_percentage_error(y_test, predicted_test)
r2_test = r2_score(y_test, predicted_test)
# Printing metrics
print("RMSE(Root Mean Squared Error):", round(rmse_test, 2))
print("R Squared:", round(r2_test, 2))
print("MAE(Mean Absolute Error):", round(mae_test, 2))
print("MAPE(Mean Absolute Percentage Error):", round(mape_test, 2))
# Plotting the figure
plt.figure(figsize = (10,10))
# Adding the scatter plot of residuals vs original value
plt.scatter(y_test, (y_test - predicted_test), c = "black", edgecolors = "white", marker = "o")
# Setting x-axis, y-axis label and title of the plot
plt.xlabel("Appliances Wh")
plt.ylabel("Residuals")
plt.title("Residuals plot")
plt.show()
Plot shows a residual plot for the linear regression model. The residuals were computed as the difference between the real values and the predicted values. From plot above, it is obvious that the relationship between the variables and the energy consumption of appliances is not well represented by the linear model since the residuals are not normally distributed around the horizontal axis.
# Creating new dictionary for creating dataframe from it
dictionary = {}
# Some lists to create the columns in dataframe
variables_chosen, no_variables_chosen = [], []
rmse, mae, mape, r2 = [], [], [], []
# Looping over to get various #variable for RFE
for i in range(len(x_train.columns), 0, -1):
# Creating the RFE
rfe = RFE(paper_model, n_features_to_select = i, step = 1)
# Fitting the RFE
rfe_model = rfe.fit(x_train, y_train)
# Predicting using rfe model
predicted = rfe_model.predict(x_test)
# Appending some information to the lists
no_variables_chosen.append(i)
variables_chosen.append(", ".join(x_train.columns.values[list(rfe_model.support_)]))
rmse.append(mean_squared_error(y_test, predicted) ** 0.5)
mae.append(mean_absolute_error(y_test, predicted))
## For more detials about this function check used_function/custom_functions.py
mape.append(custfun.mean_absolute_percentage_error(y_test, predicted))
r2.append(r2_score(y_test, predicted))
# Adding keys and values to dictionary
dictionary["variables_chosen"] = variables_chosen
dictionary["no_variables_chosen"] = no_variables_chosen
dictionary["rmse"] = rmse
dictionary["mae"] = mae
dictionary["mape"] = mape
dictionary["r2"] = r2
# Creating dataframe from the dictionary
model_information = pd.DataFrame.from_dict(dictionary)
# Sorting dataframe based on r2 values
model_information = model_information.sort_values("r2", ascending = False)
# Plotting RMSE vs No of variables
# For more detials about this function check used_function/custom_functions.py
custfun.plot_RMSE_VS_NoVar(df = model_information)
model_information.head()
# Creating new training dataframes for the experimental purposes
training_modified_2 = training.copy()
# Creating dummies instead of using categorical codes
dummies_weekstatus = pd.get_dummies(training_modified_2.loc[:, "WeekStatus"])
dummies_dow = pd.get_dummies(training_modified_2.loc[:, "Day_of_week"])
# Appending the dummies columns to dataframe
training_modified_2 = pd.concat([training_modified_2, dummies_weekstatus, dummies_dow], axis = 1)
# Dropping one dummy variable from each column as well as dropping categorical columns
training_modified_2 = training_modified_2.drop(["Weekend", "WeekStatus", "Day_of_week", "Monday"], axis = 1)
# Creating new column for hour
training_modified_2["hour"] = training_modified_2["date"].dt.hour
training_modified_2.head()
# Creating new testing dataframes for the experimental purposes
testing_modified_2 = testing.copy()
# Creating dummies instead of using categorical codes
dummies_weekstatus = pd.get_dummies(testing_modified_2.loc[:, "WeekStatus"])
dummies_dow = pd.get_dummies(testing_modified_2.loc[:, "Day_of_week"])
# Appending the dummies columns to dataframe
testing_modified_2 = pd.concat([testing_modified_2, dummies_weekstatus, dummies_dow], axis = 1)
# Dropping one dummy variable from each column as well as dropping categorical columns
testing_modified_2 = testing_modified_2.drop(["Weekend", "WeekStatus", "Day_of_week", "Monday"], axis = 1)
# Creating new column for hour
testing_modified_2["hour"] = testing_modified_2["date"].dt.hour
testing_modified_2.head()
# Taking x(independent) and y(dependent) variables for models, train for training and test for testing
train_x = training_modified_2.iloc[:, 2:]
test_x = testing_modified_2.iloc[:, 2:]
train_y = training_modified_2.iloc[:, 1]
test_y = testing_modified_2.iloc[:, 1]
# Creating linear regression model
tem_model = linear_model.LinearRegression()
# Fitting training data
tem_model.fit(train_x, train_y)
# Predicting the values for testing data
predicted_test = tem_model.predict(test_x)
# Calculating various metrix for evaluation purposes
rmse_test = mean_squared_error(test_y, predicted_test) ** 0.5
mae_test = mean_absolute_error(test_y, predicted_test)
## For more detials about this function check used_function/custom_functions.py
mape_test = custfun.mean_absolute_percentage_error(test_y, predicted_test)
r2_test = r2_score(test_y, predicted_test)
# Printing metrics
print("RMSE(Root Mean Squared Error):", round(rmse_test, 2))
print("R Squared:", round(r2_test, 2))
print("MAE(Mean Absolute Error):", round(mae_test, 2))
print("MAPE(Mean Absolute Percentage Error):", round(mape_test, 2))
# Creating new dictionary for creating dataframe from it
dictionary = {}
# Some lists to create the columns in dataframe
variables_chosen, no_variables_chosen = [], []
rmse, mae, mape, r2 = [], [], [], []
# Looping over to get various #variable for RFE
for i in range(len(train_x.columns), 0, -1):
# Creating the RFE
rfe = RFE(tem_model, n_features_to_select = i, step = 1)
# Fitting the RFE
rfe_model = rfe.fit(train_x, train_y)
# Predicting using rfe model
predicted = rfe_model.predict(test_x)
# Appending some information to the lists
no_variables_chosen.append(i)
variables_chosen.append(", ".join(train_x.columns.values[list(rfe_model.support_)]))
rmse.append(mean_squared_error(test_y, predicted) ** 0.5)
mae.append(mean_absolute_error(test_y, predicted))
## For more detials about this function check used_function/custom_functions.py
mape.append(custfun.mean_absolute_percentage_error(test_y, predicted))
r2.append(r2_score(test_y, predicted))
# Adding keys and values to dictionary
dictionary["variables_chosen"] = variables_chosen
dictionary["no_variables_chosen"] = no_variables_chosen
dictionary["rmse"] = rmse
dictionary["mae"] = mae
dictionary["mape"] = mape
dictionary["r2"] = r2
# Creating dataframe from the dictionary
model_information_2 = pd.DataFrame.from_dict(dictionary)
# Sorting dataframe based on r2 values
model_information_2 = model_information_2.sort_values("r2", ascending = False)
# Plotting RMSE vs No of variables
# For more detials about this function check used_function/custom_functions.py
custfun.plot_RMSE_VS_NoVar(df = model_information_2)
model_information_2.head()
K-means clustering is one of the simplest and popular unsupervised learning algorithms. Typically, unsupervised algorithms make inferences from datasets using only input vectors without referring to known, or labelled, outcomes. This notebook illustrates the process of K-means clustering by generating some random clusters of data and then showing the iterations of the algorithm as random cluster means are updated.
We first generate random data around 4 centers.
import sys
sys.path.append("used_functions/")
import importlib
import numpy as np
import pandas as pd
from tabulate import tabulate
import matplotlib as mpl
from matplotlib import pyplot as plt
plt.style.use("ggplot")
%matplotlib inline
import custom_functions as custfun
importlib.reload(custfun)
import warnings
warnings.filterwarnings('ignore')
import IPython
# Creating centers
center_1 = np.array([1,2])
center_2 = np.array([6,6])
center_3 = np.array([9,1])
center_4 = np.array([-5,-1])
# Generate random data and center it to the four centers each with a different variance
np.random.seed(5)
# Generating data randomly
data_1 = np.random.randn(200,2) * 1.5 + center_1
data_2 = np.random.randn(200,2) * 1 + center_2
data_3 = np.random.randn(200,2) * 0.5 + center_3
data_4 = np.random.randn(200,2) * 0.8 + center_4
# Concatenating the data arrays in one big array
data = np.concatenate((data_1, data_2, data_3, data_4), axis = 0)
# Plotting an empty figure
plt.figure(figsize = (10,7))
# Plotting data points
plt.scatter(data[:,0], data[:,1], c = "lightcoral")
# Plot title, xlabel and ylabel
plt.title("Scatter plot of the data")
plt.xlabel("X Coordinate")
plt.ylabel("Y Coordinate")
plt.show()
You need to generate four random centres.
This part of portfolio should contain at least:
k is set to 4;centres = np.random.randn(k,c)*std + mean where std and mean are the standard deviation and mean of the data. c represents the number of features in the data. Set the random seed to 6.green, blue, yellow, and cyan. Set the edgecolors to red.# No of cluster are 4
k = 4
# No of features are 2
c = 2
# Calculating standard deviation and mean for the data
std = data.std()
mean = data.mean()
# Setting the random seed to 6 for reproducibility
np.random.seed(6)
# Creating centroids
centroids = np.random.randn(k,c) * std + mean
# Creating dataframe from centroids
centroid_df = pd.DataFrame(centroids)
centroid_df.index.name = "cluster"
# Colors for coordinates
colors = ["green", "blue", "yellow", "cyan"]
# Plotting an empty figure
plt.figure(figsize = (10,7))
# Plotting data points
plt.scatter(data[:,0], data[:,1], c = "lightcoral")
# Plotting cluster centroids
plt.scatter(centroids[:,0], centroids[:,1], c = colors, edgecolors = "red", s = 500)
# Putting centroid coordinates on the plot
custfun.put_coordinates(df = centroid_df, c = colors)
# Plotting the legend for the plot
patches = [ plt.plot([],[], marker = "o", ms = 10, ls = "", mec = None, color = colors[i],
label = "Cluster " + str(i))[0] for i in range(len(colors)) ]
plt.legend(handles = patches, loc = 'upper right', ncol = 1)
# Plot title, xlabel and ylabel
plt.title("Scatter plot of the data with initial centroids")
plt.xlabel("X Coordinate")
plt.ylabel("Y Coordinate")
# Saving plot for a slideshow
plt.savefig('Portfolio3/html/images/0.png', bbox_inches = "tight")
plt.show()
I. Initialization - First step is to randomly assign centroid for the clusters. This is typically done by randomly choosing K(here 4) points from the input set(we have created 4 centroids it using standard deviation and mean of the data in the above cell).
II. Cluster Assignment - Here each observation(each data point) is assigned to cluster centroid such that Within Cluster Sum of Squares is minimum or Between Cluster Sum of Square is maximum. Various metrics can be used to calculate similarities/dissimilarities between data points. This generally means assigning the data point to the closest centroid point. Here I have used Euclidean distance, which goes like following.
$$ d(P, C) = \sqrt{(x_{1} - X)^2 + (y_{1} - Y)^2} $$ where $(x_{1}, y_{1})$ is an observation point($P$) and $(X, Y)$ is a centroid point($C$).
III. Centroid Update - After all the points or observations are assigned to the centroids, each centroid is computed again. Here we are using Kmeans that's why new centroids will be computed by averaging(taking mean) the observation or data points which were assigned to the centroids in the II step. [You can see that centroid are being pulled by data points]
IV. Convergence - Step II and III are repeated until the algorithm converges or some criterion is reached. There are several ways to detect convergence. The most typical way is to run until none of the observations change cluster membership(used here).
Extras - Kmeans often gives local minima(not the best solution/not the optimal solution). To overcome this, Kmeans is run for several times each time taking a different set of initial random centroids.
data_df = pd.DataFrame(data)
current_centroids = pd.DataFrame(centroids)
current_centroids.index.name = "cluster"
previous_centroids = pd.DataFrame()
clusters = pd.DataFrame()
# Checking if both centroids are same or not
print("Centroids are same ->", current_centroids.equals(previous_centroids))
# For more detials about this function check used_function/custom_functions.py
current_centroids, previous_centroids = custfun.kmean_iteration(data = data_df,
current_centroids_df = current_centroids,
previous_centroids_df = previous_centroids)
# Checking if both centroids are same or not
print("Centroids are same ->", current_centroids.equals(previous_centroids))
# For more detials about this function check used_function/custom_functions.py
current_centroids, previous_centroids = custfun.kmean_iteration(data = data_df,
current_centroids_df = current_centroids,
previous_centroids_df = previous_centroids)
# Checking if both centroids are same or not
print("Centroids are same ->", current_centroids.equals(previous_centroids))
# For more detials about this function check used_function/custom_functions.py
current_centroids, previous_centroids = custfun.kmean_iteration(data = data_df,
current_centroids_df = current_centroids,
previous_centroids_df = previous_centroids)
# Checking if both centroids are same or not
print("Centroids are same ->", current_centroids.equals(previous_centroids))
# For more detials about this function check used_function/custom_functions.py
current_centroids, previous_centroids = custfun.kmean_iteration(data = data_df,
current_centroids_df = current_centroids,
previous_centroids_df = previous_centroids)
# Checking if both centroids are same or not
print("Centroids are same ->", current_centroids.equals(previous_centroids))
# For more detials about this function check used_function/custom_functions.py
current_centroids, previous_centroids = custfun.kmean_iteration(data = data_df,
current_centroids_df = current_centroids,
previous_centroids_df = previous_centroids)
# Checking if both centroids are same or not
print("Centroids are same ->", current_centroids.equals(previous_centroids))
# For more detials about this function check used_function/custom_functions.py
current_centroids, previous_centroids = custfun.kmean_iteration(data = data_df,
current_centroids_df = current_centroids,
previous_centroids_df = previous_centroids)
# Checking if both centroids are same or not
print("Centroids are same ->", current_centroids.equals(previous_centroids))
# Find slideshow.html file in html folder of this portfolio
IPython.display.HTML(filename = 'slideshow.html')